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Abstract 

This paper describes a new online convex op- 
timization method which incorporates a fam- 
ily of candidate dynamical models and es- 
tablishes novel tracking regret bounds that 
scale with the comparator's deviation from 
the best dynamical model in this family. Pre- 
vious online optimization methods are de- 
signed to have a total accumulated loss com- 
parable to that of the best comparator se- 
quence, and existing tracking or shifting re- 
gret bounds scale with the overall variation of 
the comparator sequence. In many practical 
scenarios, however, the environment is non- 
stationary and comparator sequences with 
small variation are quite weak, resulting in 
large losses. The proposed Dynamic Mirror 
Descent method, in contrast, can yield low 
regret relative to highly variable comparator 
sequences by both tracking the best dynam- 
ical model and forming predictions based on 
that model. This concept is demonstrated 
empirically in the context of sequential com- 
pressive observations of a dynamic scene and 
tracking a dynamic social network. 

1. Introduction 

In a variety of large-scale streaming data problems, 
ranging from motion imagery formation to network 
analysis, dynamical models of the environment play a 
key role in performance. Classical stochastic filtering 
methods such as Kalman or particle filters or Bayesian 
updates (Bain & Crisan, 2009) readily exploit dynami- 
cal models for effective prediction and tracking perfor- 
mance. However, classical methods are also limited in 

Proceedings of the 30*^ International Conference on Ma- 
chine Learning, Atlanta, Georgia, USA, 2013. JMLR: 
W&CP volume 28. Copyright 2013 by the author(s). 



their applicability because (a) they typically assume 
an accurate, fully known dynamical model and (b) 
they rely on strong assumptions regarding a genera- 
tive model of the observations. Some techniques have 
been proposed to learn the dynamics (Xie et al., 1994; 
Theodor & Shaked, 1996), but the underlying model 
still places heavy restrictions on the nature of the data. 
Performance analysis of these methods usually does 
not address the impact of "model mismatch" , where 
the generative models are incorrectly specified. 

A contrasting class of prediction methods is based 
on an "individual sequence" or "universal predic- 
tion" (Merliav & Feder, 1998) perspective; these strive 
to perform provably well on any individual observa- 
tion sequence. In particular, online convex program- 
ming methods (Ncmirovsky & Yudin, 1983; Beck & 
Teboulle, 2003; Zinkevich, 2003; Cesa-Bianchi & Lu- 
gosi, 2006) rely on the gradient of the instantaneous 
loss of a predictor to update the prediction for the next 
data point. The aim of these methods is to ensure that 
the per-round performance approaches that of the best 
offline method with access to the entire data sequence. 
This approach allows one to sidestep challenging is- 
sues associated with statistically dependent or non- 
stochastic observations, misspecified generative mod- 
els, and corrupted observations. This framework is 
limited as well, however, because performance bounds 
are typically relative to either static or piecewise con- 
stant comparators and do not adequately reflect adap- 
tivity to a dynamic environment. 

This paper describes a novel framework for prediction 
in the individual sequence setting which incorporates 
dynamical models - effectively a novel combination of 
state updating from stochastic filter theory and online 
convex optimization from universal prediction. We es- 
tablish tracking regret bounds for our proposed algo- 
rithm. Dynamic Mirror Descent (DMD), which scale 
with the deviation of a comparator sequence from a se- 
quence evolving with a known dynamic. These bounds 
simplify to previously shown bounds, when there are 
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no dynamics. We further estabhsh tracking regret 
bounds for another algorithm, Dynamic Fixed Share 
(DFS), which scale with the deviation of a compara- 
tor sequence from a sequence evolving with the best 
sequence of dynamical models. While our methods and 
theory apply in a broad range of settings, we are par- 
ticularly interested in the setting where the dimen- 
sionality of the parameter to be estimated is very high 
relative to the data volume. In this regime, the incor- 
poration of both dynamical models and sparsity regu- 
larization plays a key role. With this in mind, we focus 
on a class of methods which incorporate regularization 
as well as dynamical modeling. The role of regular- 
ization, particularly sparsity regularization, is increas- 
ingly well understood in batch settings and has re- 
sulted in significant gains in ill-posed and data-starved 
settings (Banerjce et al., 2008; Ravikumar et al., 2010; 
Candes et al., 2006; Belkin & Niyogi, 2003). 

In our experiments, we consider reconstructing motion 
imagery from sequential observations collected with a 
compressive camera and estimating the dynamic social 
network underlying over 200 years of U.S. Senate roll- 
call data. There has been significant recent interest in 
using models of temporal structure to improve time se- 
ries estimation from compressed sensing observations 
(Angelosante et al., 2009; Vaswani & Lu, 2010) or for 
time- varying networks (Snijders, 2001; Kolar et al., 
2010); the associated algorithms, however, are typi- 
cally batch methods poorly suited to large quantities 
of streaming data. This paper strives to bridge that 
gap. 

2. Problem formulation 

Let X denote the domain of our observations, and let 
& denote a convex feasible set. Given sequentially ar- 
riving observations x S X°°, we wish to construct a 
sequence of predictions = {61,62, . . .) G 8°°, where 
6t may depend only on the currently available obser- 
vations Xt-i — (xi, . . . , Xt^i). We pose our problem 
as a dynamic game between a Forecaster and the En- 
vironment. At time t, the Forecaster computes a pre- 
diction, 6t and the Environment generates the obser- 
vation Xt- The Forecaster then experiences the loss 
iti^t), defined as follows. Let and TZ denote fam- 
ilies of convex functions, and let ft{')—f{',xt) G 
be a cost function measuring the accuracy of the pre- 
diction 6t with respect to the datum xt- Similarly, 
let r(-) g TZ be a regularization term which does not 
change over time; for instance, r might promote spar- 
sity or other low-dimensional structure in the poten- 
tially high-dimensional space 8. The loss at time t 



where 

et&C^e^ f + r: f eJ^,ren}. 

The task facing the Forecaster is to create a new pre- 
diction 6t+i based on the previous prediction and the 
new observation, with the goal of minimizing loss at 
the next time step. We characterize the efhcacy of 
9t={6i, 62, ■ ■ ■ , 6t) € 8"^ relative to a comparator se- 
quence 9t={6i, 62, ■ ■ ■ , 6t) G 8^ as follows: 

Definition 1 (Regret). The regret of 9t with respect 
to a comparator 9t G 8"'" is 

T T 

RT{dT)=Y,lt{6t)-Y,lt{0t). 

t=l t=l 

Previous work proposed algorithms which yielded re- 
gret of 0{-\/T) for static comparators, where 6t ~ 6 for 
all t. Our goal is to develop an online convex optimiza- 
tion algorithm with low regret relative to a broad fam- 
ily of time-varying comparator sequences. In particu- 
lar, our main result is an algorithm which incorporates 
a dynamical model, denoted $4, which admits a regret 
bound of the form 0(%/T[l-|-X;t \\6t+i-<^t[6t)\\]). This 
bound scales with the compartor sequence's deviation 
from the dynamical model <&t - a stark contrast to pre- 
vious tracking regret bounds which are only sublinear 
for comparators which change slowly with time or at 
a small number of distinct time instances. 

3. Static, tracking, shifting, and 
adaptive regret 

In much of the online learning literature, the com- 
parator sequence is constrained to be static or time- 
invariant. In this paper we refer to the regret with 
respect to a static comparator as static regret: 
Definition 2 (Static regret). The static regret of Ot 
is 

T T 

RT=y^it(6t) - miny^it{6). 

t=i t=i 

Static regret bounds are useful in characterizing how 
well an online algorithm performs relative to, say, 
a loss-minimizing batch algorithm with access to all 
the data simultaneously. More generally, static re- 
gret bounds compare the performance of the algorithm 
against a static point, 6* , which can be chosen with full 
knowledge of the data. 

However, this form of analysis fails to illuminate the 
performance of online algorithms in dynamic settings 
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where a static comparator is inappropriate. Perfor- 
mance relative to a temporally-varying or dynamic 
comparator sequence has been studied previously in 
the literature in the context of tracking regret, shift- 
ing regret (Herbster & Warmuth, 2001; Cesa-Bianchi 
et al., 2012), and the closely-related concept of adap- 
tive regret (Littlcstone & Warmuth, 1994; Hazan & 
Scshadhri, 2009). 

In particular, tracking regret compares the output 
of the online algorithm to a sequence of points 
91,02, ■■■,0^ which can be chosen collectively with full 
knowledge of the data. This is a fair comparison for 
a batch algorithm that detects and fits to drift in the 
data, instead of fitting a single point. Frequently, in 
order to bound tracking regret there needs to be a mea- 
sure of the complexity of the sequence 61,62, .■.,9^^-^. 
Typically, this complexity is characterized via a mea- 
sure of the temporal variability of the sequence, such 
as 

T 

, A II /I /111 



model, denoted <i>t 



t=l 



If this complexity is allowed to be very high, we could 
imagine that the comparator series would fit the series 
of losses closely and hence generalize poorly. Con- 
versely if this complexity is restricted to be 0, the 
tracking regret becomes equivalent to static regret. 
Tracking and shifting regret are the same concept, al- 
though the term shifting regret is used more in the 
"experts" setting, while tracking regret tends to be a 
more generic term. 

Adaptive regret is a related concept to tracking regret. 
Instead of measuring accumulated regret over the en- 
tire series, however, adaptive regret measures accumu- 
lated loss over an arbitrary time interval of length t, 
and measures performance against a static comparator 
chosen optimally on this interval: 



Rt-= max 

[r,s]c[l,T];s+l~r<T 



_t=r 



'minV4(6') 

t=r 



This is a valuable metric as it assures that a process 
will have low loss not just globally, but also at any 
given moment. Intuitively we can see that an algo- 
rithm with low adaptive regret on any interval should 
also have low tracking regret and vice versa. The re- 
lationship between the two has been formally shown 
(Cesa-Bianchi et al., 2012). 

In this paper, we present tracking/shifting regret 
bounds which rely on a much more general notion of 
the complexity of a comparator sequence. In particu- 
lar, we could measure the complexity of a sequence in 
terms of how much it deviates from a given dynamical 



^.(^t)^Ei 

t=l 
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Ultimately, we consider a family of dynamical models, 
and we measure the complexity of a comparator in 
terms of how much it deviates from the best sequence 
of dynamical models in this family. (These concepts 
will be formalized and detailed in the next two sec- 
tions.) 

It is intuitively satisfying that this measure appears in 
the bound. Firstly, if the comparator actually follows 
the dynamics, we would imagine this complexity to be 
very small, leading to low tracking regret. This fact 
holds whether $( is part of the generative model for the 
observations or not. Secondly, we can get a dynamic 
analog of static regret, where we enforce V$(0t) = 0. 
This is equivalent to saying that the batch comparator 
is fitting the best single trajectory using <I>t instead of 
the best single point. Using this, we would recover a 
bound analogous to a static regret bound in a station- 
ary setting. 

Concurrent related work considers online algorithms 
where the data sequence is described by a "predictable 
process" (Rakhlin & Sridharan, 2012). By knowing 
a good estimate for the underlying process, they can 
create a prediction sequence that follows accordingly, 
reducing overall loss. However, they express their re- 
sults in terms of a static regret bound {i.e., regret with 
respect to a static comparator) with a variation term 
that expresses the deviation of the input data from 
the underlying process. In contrast, we make no as- 
sumptions about the data itself, but instead on the 
comparator series, and form tracking regret bounds. 

4. Online convex optimization 

One common approach to forming the predictions 9t, 
Mirror Descent (MD) (Nemirovsky & Yudin, 1983; 
Beck & Tcboulle, 2003), consists of solving the fol- 
lowing optimization problem: 



9t+i =argmmr]t{\7et{9t),6) + D{9\\9t), (2) 
fee 



where W£t{6) denotes an arbitrary subgradient oi £t at 
6, D(6\\6t) is the Bregman divergence between 6 and 
6, and 774 > is a step size parameter. Let ip denote a 
continuously differentiable function that is cr-strongly 
convex with respect to a norm || • || on the set Q for 
some cr > 0; the Bregman divergence associated with 
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Tp is defined as 

Di9i\\e2) =D^{ei\\92) (3a) 

H{9i)-ij{92)-{VtP{92), 9,-92) (3b) 
=D{93\\92) + D{9,\\93) 

+ {^i^{92)~^i'{93),03 -9i) (3c) 

for all 9i,92,93 £ O, and the strong convexity of ip 
implies 

^(^111^2) > 111^1-^211'. 

The MD approach is a generalization of online learning 
algorithms such as online gradient descent (Zinkevich, 
2003) and weighted majority (Littlestone & Warmuth, 
1994). Several recently proposed methods consider the 
data-fit term separately from the regularization term 
(Duchi et al., 2010; Xiao, 2010; Langford et al., 2009). 
For instance, consider Composite Objective Mirror De- 
scent (COMD) (Duchi et al., 2010): 

9t+i ^a.Tgmiii7^t{yM9t),9)+7itr{9) + Di9\\9t). (4) 
fee 

This formulation is helpful when the regularization 
function r{9) promotes sparsity in 9, and helps en- 
sure that the individual 9t are indeed sparse, rather 
than approximately sparse as are the solutions to the 
MD formulation. The regret of this approach has pre- 
viously been characterized as follows: 

Theorem 3 (Static regret for COMID (Duchi 
et al., 2010)). Let G/^max^eeje^ l|V/((?)||, A„ax = 
maxg-^^g^fzQ D{9i\\92) and assume that 9=9i = 
92 = ■■■ ^ 9t. If r{9i) = and r]t = 
{2aD^,^)^/y{GfVT), then 



(DMD). Let $t : i-> 8 denote a predetermined dy- 
namical model, and set 



1/2 



5. Dynamical models in online convex 
programming 

Unlike the bound in Theorem 3, tracking or shift- 
ing regret (Ccsa-Bianchi & Lugosi, 2006; Cesa-Bianchi 
et al., 2012) bounds typically consider piecewise con- 
stant comparators, where 9t — 9t-i = for all but 
m values of t, where m is a constant, or yield regret 
bounds which scale with \\9t — 9t-i\\. In this pa- 
per, we develop tracking regret bounds which are small 
for much broader classes of dynamic comparator se- 
quences. 

In particular, we propose the following alternative to 
(2) and (4), which we call Dynamic Mirror Descent 



9t+i =SiTgmmr]t{Vft{9t),9) +r]tr{9) 
eee 



1 = $t 



7t+i 



Dim) 

(5a) 
(5b) 



By including in the process, we effectively search 
for a predictor which (a) attempts to minimize the loss 
and (b) which is close to 9t under the transformation 
of $t . This is similar to a stochastic filter which al- 
ternates between using a dynamical model to update 
the "state", and then uses this state to perform the 
filtering action. A key distinction of our approach, 
however, is that we make no assumptions about $j's 
relationship to the observed data. 

Our approach effectively includes dynamics into the 
COMID approach. Indeed, for a case with no dynam- 
ics, so that $t(0) = 9 for all 9 and t, our method is 
equivalent to COMID. Rather than considering CO- 
MID, we might have used other online optimization 
algorithms, such as the Regularized Dual Averaging 
(RDA) method (Xiao, 2010), which has been shown 
to achieve similar performance with more regularized 
solutions. However, to the best of our knowledge, 
no tracking or shifting regret bounds have been de- 
rived for dual averaging methods (regularized or oth- 
erwise). Recent results on the equivalence of COMID 
and RDA (McMahan, 2011) suggest that the bounds 
derived here might also hold for a variant of RDA, but 
proving this remains an open problem. 

Our main result uses the following definitions: 

Ge= max \\We(9)\\ 

max 11^(0)11 

Anax= max D{9'\\9), 
e,e'ee 

and A$,= max D(M\\M') - D(9\\9'), 

Theorem 4. Let $t be a dynamical model such that 
< 0. Let the sequence 6t be as in (5b), and 
let 9t be an arbitrary sequence in e'^. Then the Dy- 
namic Mirror Descent (DMD) algorithm using a non- 
increasing series r]t+i < rjt gives 



Vt+1 Vt 

T 

with V^^{9T)=^\\9t+i ~ ^ 



^^-V^AeT) + ^Y.^t (6) 



(7) 
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where ¥^^{0^) measures variations or deviations of the 
comparator sequence Ot from the dynamical model $t . 

Note that when $t corresponds to an identity operator, 
the bound in Theorem 4 corresponds to existing track- 
ing or shifting regret bounds (Cesa-Bianchi & Lugosi, 
2006; Cesa-Bianchi et al., 2012). The condition that 
< is similar to requiring that <I>j be a contrac- 
tion mapping. This restriction is important; without 
it, any poor prediction made at one time step could 
be magnified by repeated application of the dynam- 
ics. Additive models and matrix multiplications with 
all eigenvalues less than or equal to unity satisfy this 
restriction. Notice also that if $t = / for all i, the the- 
orem gives a novel tracking regret bound for COMID. 
To prove Theorem 4, we employ the following lemma, 
which is proven in Section 9. 

Lemma 5. Let the sequence Ot he as in (5b), and let 
6t he an arhitrary sequence in e^.- then 



eA) - et{0t) < - \D{0t\\0t) 

Vt L 
Vt Vt 



D{et+i\\et+i) 

Vt 



M0t)\ 



2a 



Go 



Proof of Theorem 4: The proof is a matter of sum- 
ming the bounds of Lemma 5 over time. For simplicity 
denote A^^l^tH^t) and VApt+i - $t(^t)||. Then 



6. Prediction with a family of 
dynamical models 

DMD in the previous section uses a single dynamical 
model. In practice, however, we do not know the best 
dynamical model to use, or the best model may change 
over time in nonstationary environments. 

To address this challenge, we assume a finite set of 

(1) ^(2) 



candidate dynamical models {$( , $ 



$^^^}, and 

describe a procedure which uses this collection to 
adapt to nonstationarities in the environment. In par- 
ticular, we establish tracking regret bounds for a com- 
parator class with different dynamical models on dif- 
ferent time intervals. This class, 0m, can be described 
as all predictors defined on m -1-1 segments [t^, i^+i — 1] 
with time points 1 = ti < • • • < im+2 = T + 1. For a 
given Ot G Qm and fc = 1, . . . , to -f 1, let 



y('"+i)(6/-r)= 



m+l 



tfc+l — 1 



denote the deviation of the sequence Ot from the best 
series of m -I- 1 dynamical models. 

Let 91 denote the output of the DMD algorithm of 



Section 5 using dynamical model 
regret can be expressed as: 



Then tracking 



RT{OT)<t(^-^)+fhlt 



V Vt vt+i 

T 



< 



.I^niav 4M 



- \Vt+i VtJ Vt 



G 



Vt+1 Vt 



t=l 



We set rjt using the doubling trick (Cesa-Bianchi & Lu- 
gosi, 2006) whereby time is divided into increasingly 
longer segments, and on each interval a temporary 
time horizon is fixed, known, and used to determine 
an optimal step size (generally proportional to the in- 
verse of the square root of the time horizon). This 
approach yields the regret bound: 



i?(0T) -o(Vr[i + y*,(0T)]) 

This proof shares some ideas with the tracking regret 
bounds of (Zinkevich, 2003), but uses properties of the 
Bregman Divergence to eliminate some terms, while 
additionally incorporating dynamics. 



t = l 



1 1 



where the minimization in the second term of Ti and 
first term of T2 is with respect to sequences of dy- 
namical models with at most to switches, such that 
Ym=i l[it7^it+i] — (^)' ^1 corresponds to the 

tracking regret of our algorithm relative to the best 
sequence of dynamical models within the DMD frame- 
work, and T2 is the regret of that sequence relative to 
the best comparator in the class Om- 

We choose 9t by using the Fixed Share (FS) forecaster 
on the DMD estimates of (5), ft^*"*. In FS, each expert 
(here, each candidate dynamical model) is assigned a 
weight that is inversely proportional to its cumula- 
tive loss at that point yet with some weight shared 
amongst all the experts, so that an expert with very 
small weight can quickly regain weight to become the 
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leader (Ccsa-Bianchi & Lugosi, 2006). Our estimate 



is: 



(9) 



t =(A/iV) Ef=i wj,t + (1 - X)m,t (10) 



N 



N 



(11) 



Following (Ccsa-Bianchi & Lugosi, 2006), we have 
_m + l, 1, 1 Vr „ 

^°g^+^^°g A-(l-A)(^-»--^) +¥^ 

and T2 can be bounded using the method described 
in Section 5 on each time interval — 1] and 

summing over the to + 1 intervals, yielding 

< (^+l)Ana. ^ 4M^(„^,) + ^ y 

?7t+i ??t 2cr -f^ 



Letting rjr = rjt = l/vT, the overall expected tracking 
regret is thus 

Rrier) = O 

+ log 



T (TO+l)(logAr + Anax) 



A"(l- A)(^-™-i) 

The last term in this bound measures the deviation of 
a comparator in 0^ from the best series of dynamical 
models over to + 1 segments (where to does not scale 
with T). Here A is usually chosen to be ^ where to is 
an upper bound on the number of switches, indepen- 
dent of T. Again, if T is not known in advance the 
doubling trick can be used. Note that y('"+i)(0T) < 
V^(i) {6t) for any fixed i £ {1, . . . , N}, thus this ap- 
proach generally yields lower regret than using a fixed 
dynamical model. However, we incur some loss by not 
knowing the optimal number of switches m or when 
the switching times are; these are accounted for in Ti. 

We use the Fixed Share algorithm as a means to amal- 
gamate estimates with different dynamics, however 
other methods could be used with various tradeoffs. 
The Fixed Share algorithm, for instance, has linear 
complexity with low regret, but with respect to a com- 
parator class with fixed number of switches. Other al- 
gorithms can accommodate larger classes of experts, 
or not assume knowledge of the number of switches, 
but come at the price of higher regret or complexity 
as explained in (Gyorgy ct al., 2012). 



consider two scenarios: reconstruction of a dynamic 
scene (i.e., video) from sequential compressed sensing 
observations, and tracking connections in a dynamic 
social network. 

7.1. Compressive video reconstruction 

To test DMD, we construct a video which contains an 
object moving in a 2-dimensional plane; the t*'^ frame 
is denoted 9t (a 150 x 150 image stored as a length- 
22500 vector) which takes values between and 1. The 
corresponding observation is xt = AtOt + nt, where At 
is a random 500 x 22500 matrix and rit corresponds to 
measurement noise. This model coincides with several 
compressed sensing architectures (Duarte et al., 2008). 
We used white Gaussian noise with variance 1. 



Our loss function uses ft{d) — \\\xt — At9\\2 



_ 2 ^iid 

where t > is a tuning parameter. 



r{0) = rll^lli 
We construct a family of = 9 dynamical mod- 
els, where <^'f\9) shifts the frame, 6, one pixel in a 
direction corresponding to an angle of 2m /{N — 1) 
as well as a "dynamic" corresponding to no motion. 
(With the static model, DMD reduces to COMID.) 
The true video sequence uses different dynamical mod- 
els over t = {1,...,240} and t = {241, 500}. Fi- 
nally, we use ip{-) ~ II • II 2 so the Bregman Divergence 
D{x\\y) = \\x 

tance. The DFS forecaster uses A — 0.01 



- y||2 is the usual squared Euchdean dis- 
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Figure 1. Tracking dynamics using DFS and comparing in- 
dividual models for directional (N, S, E, etc.) motion. Be- 
fore t = 240 the NE motion dynamic model incurs small 
loss, where as after t — 240 the SE motion does well, and 
DFS successfully tracks this change. 



7. Experiments and results 

To demonstrate the performance of Dynamic Mirror 
Descent (DMD) combined with the Fixed share algo- 
rithm (which we call Dynamic Fixed Share (DFS)), we 



Figures 1 and 2 show the impact of using DFS. We see 
that DFS switches between dynamical models rapidly 
and outperforms all of the individual predictions, in- 
cluding COMID, used as a basehne, to show the ad- 
vantages of incorporating knowledge of the dynamics. 
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Figure 2. Instantaneous predictions at t — 480. Top Left: 
Ot- Top Rigiit: 0^^-^'. Bottom Left: 0^-^'. Bottom Riglit: 
Of. The prediction made witli the prevailing motion is an 
accurate representation of the ground truth, while the pre- 
diction with the wrong dynamic is an unclear picture. The 
DFS algorithm correctly picks out the cleaner picture. 



7.2. Tracking dynamic social networks 

Dynamical models have a rich history in the context 
of social network analysis (Snijders, 2001), but we are 
unaware of their application in the context of online 
learning algorithms. To show how DMD can bridge 
this gap, we track the influence matrix of seats in 
the US Senate from 1795 to 2011 using roll call data 
(http://www.voteview.com/dwnl.htm). At time t, we 
observe the "yea" or "nay" vote of each Senator, which 
we represent with a +1 or —1. When a Senator's vote 
is unavailable (for instance, before a state joined the 
union), we use a 0. We form a length p = 100 vector 
of these votes indexed by the Senate seat, and denote 
this Xt- 

Following (Ravikumar et al., 2010), we form a loss 
function using a negative log Ising model pseudolikeli- 
hood to sidestep challenging issues associated with the 
partition function of the Ising model likelihood. For 
a social network with p agents, 9t € [—1, 1]p^p, where 
{0t)ab corresponds to the correlation in voting patterns 
between agents a and b at time t. Let V denote the 
set of agents, V\a the set of all agents except a, Xa 
the vote of agent a, and Oa—{Oab ■ b £ V}. Our loss 
function is 

)=l0g l^exp (2eaaXa + '^J2beV\a ^abXaXbj + ij 

/(-^^ {Oa, X)^ - 20aaXa - 2 EbeV\a OabXaXb + <^^"^ [Oa] 
fi9:x)^Eaevf^''Kda;x) 



and r{6) = t\\9\\i, where t > is a tuning parameter; 
this loss is convex in 6. We set tp{d) = jHfi'lli £^nd use 
a dynamical model inspired by (Snijders, 2001), where 
if \9ac*0bc'\ > \Oab\, with c* =argmax^ \Oac0bc\, then: 



(1 - ai)dab + a^9ac'6bc' 



Otherwise, 6^^^ ~ o']^- The intuition is that if two 
members of the network share a strong common con- 
nection, they will become connected in time. We set 
ai e {0, .001, .002, .003, .004} for the different dynam- 
ical models. We set r = .1 and again set rj using the 
doubling trick with time horizons at set at increasing 
powers of 10. As in (Langford et al., 2009), we find 
that regularizing {e.g., thresholding) every 10 steps, 
instead of at each time step, allows for the values to 
grow above the threshold for meaningful relationships 
to be found. 
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Figure 3. Tracking a dynamic social network. Losses for 
different dynamical models and the DFS predictions; a = 
corresponds to COMID. 
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Figure 4. Losses for individual senators. Low losses cor- 
respond to predictable, consistent voting behavior, while 
higher loss means less predictable 
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Figure 3 shows the average per round loss of each 
model, and the DFS estimator over a 30 year time 
window. We see that applying the dynamical model 
improves performance relative to COMID (ai = 0) 
and that DFS aggregates the predictions successfully. 
Figure 4 shows the moving average losses for a few 
Senators, where high loss corresponds to behavior un- 
expected in the model. Notice that John Kerry (D- 
MA) has generally low loss, spikes around 2006, and 
then drops again before a reelection campaign in 2008. 

Looking at the network estimates of DFS across time 
(as in Figure 5) we can see tight factions forming in the 
mid- to late-1800s (post Civil War), followed by a time 
when the factions dissipate in the mid-1900s during the 
Civil Rights Movement. Finally, we see factions again 
forming in more recent times. The seats are sorted sep- 
arately for each matrix to emphasize groupings, which 
align with known political factions. 
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Figure 5. Influence matrices for select years spanning Civil 
War and Civil Rights Movement to present, showing for- 
mation of factions. Warmer colors (reds and greens) cor- 
respond to higher influence and colder (blue) corresponds 
to lower influence. (Best viewed in color) 



8. Conclusion and future directions 

In this paper we have proposed a novel online op- 
timization method, called Dynamic Mirror Descent 
(DMD), which incorporates dynamical model state up- 
dates. There is no assumption that there is a "true" 
known underlying dynamical model, or that the best 
dynamical model is unchanging with time. The pro- 
posed Dynamic Fixed Share (DFS) algorithm adap- 
tively selects the most promising dynamical model 
from a family of candidates at each time step. Re- 
cent work on shifting or tracking regret bounds for 
online convex optimization further suggest that the 
techniques developed in this paper may also be useful 
for bounding adaptive regret or developing methods 
for automatically tuning step-size parameters (Cesa- 



Bianchi ct al., 2012). In experiments with real and 
simulated data, DMD shows strong tracking behavior 
even when underlying dynamical models are switching. 

9. Proofs 

Proof of Lemma 5: 

The optimality condition of (5a) implies 

{yft{Ot) + vr{et+i),0t+i-dt) < 
1 



-(V^(0t)-VV(et+i),0t+i 



(13) 



Using this condition we can bound the instantaneous 
regret as follows: 

=ft{Ot) - ftiOt) + ridt) - r{et+i) + r{et+i) - r(0O 



+ (Vr((?t+i) 



<-(V^(0t) - V^(0t+i), - Ot) 

\D{et\\et) - D{0t+,\\0t+i) 

Vt L J 
+ T3 + Ti/j-jt + T^/rjt where, 

T3= - -D{et+i\\et) + {W£t{9t),9t - 0t+i 
Vt 



(14a) 
(14b) 
(14c) 



T4 = 

rri A 

— 



Di^t{OtmtiOt+i)) - D{9t\\et+i) 
D{9t+i\\et+i) - D{^ti9t)\\0t+i) 



< A 



Here, (14a) follows from the convexity of ft and r, 
(14b) follows from the optimality condition of (5a), 
and (14c) follows from (3c) and adding and subtracting 
terms using the equivalence (5b). Each of term can be 
bounded, and then combined to complete the proof. 

T3<^ :^\\9t+i~9tf 



■ 27/t" ' 2a"' 2a' 

n ^^{9t+i) - {V^{9t+i),9t+i - 9t+i) 

- MMOt)) + {V^i9t+i),M0t) - 9t+i) 
=^{9t+i) - ^PiMOt)) ~ {V^(9t+i),9t+i - $t(0t)) 
<UI\\9t+i - MSt)\\ (15b) 

where (15a) is due to the strong convexity of the Breg- 
man Divergence and Young's inequality and (15b) is 
due to the convexity of ip and the Cauchy-Schwarz in- 
equality. Combining these inequalities with (14c) gives 
the Lemma as it is stated. □ 



Vt ^2 _ Vt ^2 



(15a) 
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